refactor(sandbox): move sandbox state management server-side (ENG-4287)#381
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aee0b2cd40
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…+ inspector Addresses Codex P2 feedback on #381. The defensive guards required an envdAccessToken (and non-null domain), but those are legitimately absent for secure:false sandboxes (envd is reachable without X-Access-Token) and domain can be null (the SDK falls back to the configured E2B domain) — regressing support that worked before this PR. - create-envd-sandbox: envdAccessToken is now optional. - sandbox.openTerminal: stop throwing when getFullInfo has no envdAccessToken; pass it through as-is. - inspect context: drop the !envdAccessToken || !domain early-return (keep the killed-state narrowing guard). - test: cover connecting to a tokenless sandbox.
|
Addressed both Codex P2 comments in 37592de: @codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 37592def2a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… (ENG-4287) Move sandbox control-plane state management to the server so account tokens never reach the browser, and add the ability to pause and resume a sandbox from the dashboard. - Run sandbox connect/resume and pause server-side via tRPC mutations using the E2B SDK; the terminal and inspector use sandbox-scoped envd credentials. - Add a Pause button next to Kill in the sandbox header, shown only while the sandbox is running, with a confirmation popover. - Resume a paused sandbox from the inspect Filesystem/Terminal empty states, noting the resume window and that it can be paused again. - Prevent a paused sandbox from auto-resuming: optimistically flip state to paused so the live terminal PTY and filesystem watcher tear down before the pause snapshot runs, instead of keeping the sandbox alive with envd traffic. - Show the "Sandbox Paused" empty state for both the filesystem and terminal instead of stale cached data. - Support tokenless (secure:false) sandboxes in the terminal and inspector, and guard against accidental resume / TTL extension while inspecting.
b78f6d8 to
b8db864
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b8db86413c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| queryClient.setQueryData(detailsKey, (old) => | ||
| old?.state === 'running' ? { ...old, state: 'paused' as const } : old |
There was a problem hiding this comment.
Prevent terminal cleanup from re-resuming paused sandboxes
When pausing from the sandbox terminal tab, this optimistic paused cache update unmounts DashboardTerminal immediately; its unmount cleanup calls requestPtyKill, and sandbox.killTerminalPty still performs Sandbox.connect(... timeoutMs: TERMINAL_SANDBOX_TIMEOUT_MS) before killing the PTY. If that keepalive request lands after the pause snapshot completes, it auto-resumes the sandbox and extends it for 30 minutes, so the new Pause action can appear to succeed while leaving an active sandbox running.
Useful? React with 👍 / 👎.
| sandboxConnectRequestTimeoutMs={ | ||
| shouldResumeSandbox ? SANDBOX_TERMINAL_RESUME_TIMEOUT_MS : undefined | ||
| } | ||
| sandboxManagementAuth={sandboxManagementAuth} | ||
| sandboxScoped | ||
| teamSlug={team.slug} |
There was a problem hiding this comment.
🔴 The terminal-tab Resume button gives the sandbox a 30-minute TTL while the empty-state copy on the same screen promises "Runs for 70s" — a ~26× silent TTL extension, exactly the anti-pattern this PR's own inspect/constants.ts comment says it is trying to prevent. The filesystem-tab Resume correctly routes through sandbox.resume (70s TTL); only the terminal-tab Resume bypasses it by going through sandbox.openTerminal (TERMINAL_SANDBOX_TIMEOUT_MS = 30 min). Fix by routing the terminal-tab resume through sandbox.resume first (or by adding a resume mode to sandbox.openTerminal that honors SANDBOX_RESUME_TIMEOUT_MS) before attaching the PTY.
Extended reasoning...
Terminal-tab Resume silently extends sandbox TTL to 30 minutes
What the bug is
SandboxInspectNotFound is rendered on both the filesystem and terminal tabs for paused sandboxes and renders the unconditional message:
// inspect/not-found.tsx
<>
Resume this sandbox to access the {resourceName}.
<span className="mt-1 block text-fg-tertiary">
Runs for {RESUME_TIMEOUT_SECONDS}s, then stops unless the timeout is
extended. You can pause it again anytime.
</span>
</>where RESUME_TIMEOUT_SECONDS = Math.round(SANDBOX_RESUME_TIMEOUT_MS / 1000) = 70. So the user is told the resumed sandbox will run for 70 seconds.
On the filesystem tab this promise is honest: resumeSandbox() in inspect/context.tsx calls trpcClient.sandbox.resume.mutate(...), whose server mutation does Sandbox.connect(... timeoutMs: SANDBOX_RESUME_TIMEOUT_MS /* 70_000 */).
On the terminal tab, the same Resume button takes a completely different path that ignores SANDBOX_RESUME_TIMEOUT_MS entirely.
Code path that triggers it
Step-by-step trace from src/features/dashboard/sandbox/terminal/view.tsx:
- Sandbox is paused →
SandboxInspectNotFoundis shown with the "Runs for 70s" copy. - User clicks Resume sandbox →
onResumeSandboxrunssetShouldResumeSandbox(true). - With
shouldResumeSandbox=true, the empty-state branch at lines 67–80 is bypassed andDashboardTerminalis mounted at lines 84–101 withautoStart,launchTarget={sandboxId, template}, andsandboxConnectRequestTimeoutMs={SANDBOX_TERMINAL_RESUME_TIMEOUT_MS}(= 70_000). autoStarttriggersqueueTerminalCommand→startTerminal→openTerminalSandbox→acquireTerminalSandbox→trpcClient.sandbox.openTerminal.mutate({ teamSlug, template, sandboxId, requestTimeoutMs: 70_000 }).- The
openTerminalmutation insrc/core/server/api/routers/sandbox.tshits theif (sandboxId)branch:
const sandbox = await Sandbox.connect(sandboxId, {
...connectionOpts,
timeoutMs: TERMINAL_SANDBOX_TIMEOUT_MS, // 30 * 60 * 1000
requestTimeoutMs, // 70_000 — HTTP timeout only
})Sandbox.connect's timeoutMs is the SDK's sandbox-TTL parameter — exactly the field the inspect-side comment in constants.ts warns about. requestTimeoutMs is only the HTTP-call timeout, not the sandbox lifetime. So clicking Resume from the terminal tab grants the sandbox a 30-minute TTL.
Why existing code doesn't prevent it
The PR added SANDBOX_RESUME_TIMEOUT_MS and the sandbox.resume mutation specifically to avoid this, with an explicit comment:
Kept short on purpose: resuming for inspect should give a brief debug window, not silently extend a customer sandbox's lifetime.
The filesystem path in inspect/context.tsx calls sandbox.resume. But the terminal tab's Resume button never reaches sandbox.resume — instead it mounts DashboardTerminal, which only knows how to call sandbox.openTerminal, and that mutation hard-codes TERMINAL_SANDBOX_TIMEOUT_MS. There is no resume-aware code path on the terminal side.
Impact
A customer sandbox that was paused (possibly by the new Pause button itself) can be silently revived for 30 minutes of billable lifetime every time a developer opens its terminal tab and clicks the Resume button — while the UI on the very same screen tells them it will run for 70 seconds. This is a 26× discrepancy between the displayed lifetime and the actual lifetime, on the exact action the PR description claims has been hardened against this behaviour.
Step-by-step proof
- Open a sandbox-details page for a paused sandbox at
/dashboard/{teamSlug}/sandboxes/{sandboxId}/terminal. - The terminal tab renders
SandboxInspectNotFound→ message: "Resume this sandbox to access the terminal. Runs for 70s, then stops unless the timeout is extended." - Click Resume sandbox.
setShouldResumeSandbox(true)causesDashboardTerminalto mount withlaunchTarget={sandboxId, template}andsandboxConnectRequestTimeoutMs=70_000.DashboardTerminalauto-starts →openTerminalSandbox(... sandboxId, requestTimeoutMs: 70_000 ...)→trpcClient.sandbox.openTerminal.mutate({ ..., sandboxId, requestTimeoutMs: 70_000 }).- Server-side:
Sandbox.connect(sandboxId, { ..., timeoutMs: 30 * 60 * 1000, requestTimeoutMs: 70_000 })— sandbox is resumed with a 30-minute TTL. - UI promise vs. reality: shown 70s, actual 1800s.
How to fix
Two viable options:
-
Two-step on the client (preferred — matches filesystem flow): In
SandboxTerminalView, when the user clicks Resume, calltrpcClient.sandbox.resume.mutate({ sandboxId, requestTimeoutMs: SANDBOX_RESUME_TIMEOUT_MS })first, then mountDashboardTerminal(which now just attaches a PTY to an already-running sandbox). -
Resume-aware
openTerminal: Add amode: 'resume' | 'attach'(or boolean) input to thesandbox.openTerminalmutation; in resume mode passtimeoutMs: SANDBOX_RESUME_TIMEOUT_MStoSandbox.connectinstead ofTERMINAL_SANDBOX_TIMEOUT_MS, and haveDashboardTerminalplumb the resume flag fromSandboxTerminalView.
Option 1 keeps sandbox.resume as the single source of truth for the resume-TTL invariant and makes the terminal-tab flow structurally identical to the filesystem-tab flow.
Moves sandbox state management server-side so the user's account-level access token is never exposed to browser JS (ENG-4287), and adds the ability to pause and resume a sandbox from the dashboard. Previously the sandbox-details surfaces drove the E2B SDK client-side with
sandboxManagementAuth(the account token); now the browser only ever receives sandbox-scoped envd credentials.Server-side state management
createEnvdSandboxhelper from theenvdAccessTokenalready returned by thesandbox.detailsquery — no control-plane call, so normal inspect never resumes a paused sandbox or extends its TTL./dashboard/terminaland the sandbox-details terminal tab) runs create/connect through asandbox.openTerminaltRPC mutation; the client receives only scoped envd creds.sandbox.resumetRPC mutation server-side (shortSANDBOX_RESUME_TIMEOUT_MSwindow), then the client rebuilds the envd-only client from the returned creds. Resuming/extending TTL only happens on this explicit user action — never implicitly.secure: false) sandboxes are supported end-to-end (envdAccessTokenoptional throughout).sandbox-management-authmodules are deleted;getFullInfo/sandbox.detailsremain read-only.Pause / resume
sandbox.pausetRPC mutation using the SDK'sSandbox.pause()(consistent withsandbox.resume), so the account token stays off the client.memory: truepause can be auto-resumed by any envd traffic, so the live terminal PTY and filesystem watcher would immediately revive the sandbox. The Pause button now optimistically flips state topaused, tearing those connections down before the snapshot runs (rolled back on error); without the live traffic the snapshot completes and the sandbox stays paused.Notes
origin/mainmerged in — reconciled main's resume UX and details terminal tab onto the no-token model, and adopted main's removal of the client-side attach-retry.Tests: typecheck/lint clean on changed files.
Not yet verified live: on the preview deploy, (1) confirm via DevTools that envd requests carry only
X-Access-Token— no accountAuthorization: Bearer/X-Supabase-Token(incl. an explicit Resume and asecure: falsesandbox); (2) from the Terminal and Filesystem tabs, click Pause and confirm the sandbox shows the paused screen and does not auto-resume.Known follow-up (pre-existing, from main):
sandbox.killTerminalPtystill does a control-planeSandbox.connect(resume + extend TTL) just to kill a PTY on terminal close; it should become envd-only to avoid that side effect.